The data consists of 113,937 loans with 81 variables on each loan including loan amount, borrower(interest) rate, current loan status, borrower income, and many others. Some variables had to be simplified or summarized into smaller categories to make the analysis clearer and understandable. There are also so many variables, it would be too cumbersome to investigate them all. So the study focused on a few variables.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
# pd.options.display.max_rows = 100
pd.options.display.max_columns = 100
# Loan dataset
loans = pd.read_csv("prosperLoanData.csv")
# high-level overview of data shape and composition
print(f"Number of Columns : {loans.shape[1]}")
print(f"Number of Rows : {loans.shape[0]}")
Number of Columns : 81 Number of Rows : 113937
# Observe data types
loans.dtypes
ListingKey object
ListingNumber int64
ListingCreationDate object
CreditGrade object
Term int64
...
PercentFunded float64
Recommendations int64
InvestmentFromFriendsCount int64
InvestmentFromFriendsAmount float64
Investors int64
Length: 81, dtype: object
# Observe a sample
loans.head(10)
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | EstimatedEffectiveYield | EstimatedLoss | EstimatedReturn | ProsperRating (numeric) | ProsperRating (Alpha) | ProsperScore | ListingCategory (numeric) | BorrowerState | Occupation | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | CurrentlyInGroup | GroupKey | DateCreditPulled | CreditScoreRangeLower | CreditScoreRangeUpper | FirstRecordedCreditLine | CurrentCreditLines | OpenCreditLines | TotalCreditLinespast7years | OpenRevolvingAccounts | OpenRevolvingMonthlyPayment | InquiriesLast6Months | TotalInquiries | CurrentDelinquencies | AmountDelinquent | DelinquenciesLast7Years | PublicRecordsLast10Years | PublicRecordsLast12Months | RevolvingCreditBalance | BankcardUtilization | AvailableBankcardCredit | TotalTrades | TradesNeverDelinquent (percentage) | TradesOpenedLast6Months | DebtToIncomeRatio | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | LoanKey | TotalProsperLoans | TotalProsperPaymentsBilled | OnTimeProsperPayments | ProsperPaymentsLessThanOneMonthLate | ProsperPaymentsOneMonthPlusLate | ProsperPrincipalBorrowed | ProsperPrincipalOutstanding | ScorexChangeAtTimeOfListing | LoanCurrentDaysDelinquent | LoanFirstDefaultedCycleNumber | LoanMonthsSinceOrigination | LoanNumber | LoanOriginalAmount | LoanOriginationDate | LoanOriginationQuarter | MemberKey | MonthlyLoanPayment | LP_CustomerPayments | LP_CustomerPrincipalPayments | LP_InterestandFees | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 193129 | 2007-08-26 19:09:29.263000000 | C | 36 | Completed | 2009-08-14 00:00:00 | 0.16516 | 0.1580 | 0.1380 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | CO | Other | Self-employed | 2.0 | True | True | NaN | 2007-08-26 18:41:46.780000000 | 640.0 | 659.0 | 2001-10-11 00:00:00 | 5.0 | 4.0 | 12.0 | 1 | 24.0 | 3.0 | 3.0 | 2.0 | 472.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.00 | 1500.0 | 11.0 | 0.81 | 0.0 | 0.17 | $25,000-49,999 | True | 3083.333333 | E33A3400205839220442E84 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 78 | 19141 | 9425 | 2007-09-12 00:00:00 | Q3 2007 | 1F3E3376408759268057EDA | 330.43 | 11396.14 | 9425.00 | 1971.14 | -133.18 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 258 |
| 1 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | 0.07960 | 0.0249 | 0.05470 | 6.0 | A | 7.0 | 2 | CO | Professional | Employed | 44.0 | False | False | NaN | 2014-02-27 08:28:14 | 680.0 | 699.0 | 1996-03-18 00:00:00 | 14.0 | 14.0 | 29.0 | 13 | 389.0 | 3.0 | 5.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3989.0 | 0.21 | 10266.0 | 29.0 | 1.00 | 2.0 | 0.18 | $50,000-74,999 | True | 6125.000000 | 9E3B37071505919926B1D82 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 0 | 134815 | 10000 | 2014-03-03 00:00:00 | Q1 2014 | 1D13370546739025387B2F4 | 318.93 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 2 | 0EE9337825851032864889A | 81716 | 2007-01-05 15:00:47.090000000 | HR | 36 | Completed | 2009-12-17 00:00:00 | 0.28269 | 0.2750 | 0.2400 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | GA | Other | Not available | NaN | False | True | 783C3371218786870A73D20 | 2007-01-02 14:09:10.060000000 | 480.0 | 499.0 | 2002-07-27 00:00:00 | NaN | NaN | 3.0 | 0 | 0.0 | 0.0 | 1.0 | 1.0 | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.06 | Not displayed | True | 2083.333333 | 6954337960046817851BCB2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 86 | 6466 | 3001 | 2007-01-17 00:00:00 | Q1 2007 | 5F7033715035555618FA612 | 123.32 | 4186.63 | 3001.00 | 1185.63 | -24.20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 41 |
| 3 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | 0.08490 | 0.0249 | 0.06000 | 6.0 | A | 9.0 | 16 | GA | Skilled Labor | Employed | 113.0 | True | False | NaN | 2012-10-22 11:02:32 | 800.0 | 819.0 | 1983-02-28 00:00:00 | 5.0 | 5.0 | 29.0 | 7 | 115.0 | 0.0 | 1.0 | 4.0 | 10056.0 | 14.0 | 0.0 | 0.0 | 1444.0 | 0.04 | 30754.0 | 26.0 | 0.76 | 0.0 | 0.15 | $25,000-49,999 | True | 2875.000000 | A0393664465886295619C51 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 16 | 77296 | 10000 | 2012-11-01 00:00:00 | Q4 2012 | 9ADE356069835475068C6D2 | 321.45 | 5143.20 | 4091.09 | 1052.11 | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 4 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | 0.18316 | 0.0925 | 0.09066 | 3.0 | D | 4.0 | 2 | MN | Executive | Employed | 44.0 | True | False | NaN | 2013-09-14 18:38:44 | 680.0 | 699.0 | 2004-02-20 00:00:00 | 19.0 | 19.0 | 49.0 | 6 | 220.0 | 1.0 | 9.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6193.0 | 0.81 | 695.0 | 39.0 | 0.95 | 2.0 | 0.26 | $100,000+ | True | 9583.333333 | A180369302188889200689E | 1.0 | 11.0 | 11.0 | 0.0 | 0.0 | 11000.0 | 9947.9 | NaN | 0 | NaN | 6 | 102670 | 15000 | 2013-09-20 00:00:00 | Q3 2013 | 36CE356043264555721F06C | 563.97 | 2819.85 | 1563.22 | 1256.63 | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
| 5 | 0F05359734824199381F61D | 1074836 | 2013-12-14 08:26:37.093000000 | NaN | 60 | Current | NaN | 0.15425 | 0.1314 | 0.1214 | 0.11567 | 0.0449 | 0.07077 | 5.0 | B | 10.0 | 1 | NM | Professional | Employed | 82.0 | True | False | NaN | 2013-12-14 08:26:40 | 740.0 | 759.0 | 1973-03-01 00:00:00 | 21.0 | 17.0 | 49.0 | 13 | 1410.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 62999.0 | 0.39 | 86509.0 | 47.0 | 1.00 | 0.0 | 0.36 | $100,000+ | True | 8333.333333 | C3D63702273952547E79520 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 3 | 123257 | 15000 | 2013-12-24 00:00:00 | Q4 2013 | 874A3701157341738DE458F | 342.37 | 679.34 | 351.89 | 327.45 | -25.33 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 6 | 0F0A3576754255009D63151 | 750899 | 2013-04-12 09:52:56.147000000 | NaN | 36 | Current | NaN | 0.31032 | 0.2712 | 0.2612 | 0.23820 | 0.1275 | 0.11070 | 2.0 | E | 2.0 | 1 | KS | Sales - Retail | Employed | 172.0 | False | False | NaN | 2013-04-12 09:52:53 | 680.0 | 699.0 | 2000-09-29 00:00:00 | 10.0 | 7.0 | 20.0 | 6 | 214.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5812.0 | 0.72 | 1929.0 | 16.0 | 0.68 | 0.0 | 0.27 | $25,000-49,999 | True | 2083.333333 | CE963680102927767790520 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 11 | 88353 | 3000 | 2013-04-18 00:00:00 | Q2 2013 | AA4535764146102879D5959 | 122.67 | 1226.70 | 604.25 | 622.45 | -22.95 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 7 | 0F1035772717087366F9EA7 | 768193 | 2013-05-05 06:49:27.493000000 | NaN | 36 | Current | NaN | 0.23939 | 0.2019 | 0.1919 | 0.17830 | 0.0799 | 0.09840 | 4.0 | C | 4.0 | 2 | CA | Laborer | Employed | 103.0 | False | False | NaN | 2013-05-05 06:49:25 | 700.0 | 719.0 | 1999-02-25 00:00:00 | 6.0 | 6.0 | 10.0 | 5 | 101.0 | 3.0 | 16.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1260.0 | 0.13 | 2181.0 | 10.0 | 0.80 | 0.0 | 0.24 | $25,000-49,999 | True | 3355.750000 | 0C87368108902149313D53B | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 10 | 90051 | 10000 | 2013-05-13 00:00:00 | Q2 2013 | 737F347089545035681C074 | 372.60 | 3353.40 | 1955.89 | 1397.51 | -69.21 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 8 | 0F043596202561788EA13D5 | 1023355 | 2013-12-02 10:43:39.117000000 | NaN | 36 | Current | NaN | 0.07620 | 0.0629 | 0.0529 | 0.05221 | 0.0099 | 0.04231 | 7.0 | AA | 9.0 | 7 | IL | Food Service | Employed | 269.0 | True | False | NaN | 2013-12-02 10:43:39 | 820.0 | 839.0 | 1993-04-01 00:00:00 | 17.0 | 16.0 | 32.0 | 12 | 219.0 | 1.0 | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 9906.0 | 0.11 | 77696.0 | 29.0 | 1.00 | 1.0 | 0.25 | $25,000-49,999 | True | 3333.333333 | 02163700809231365A56A1C | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 3 | 121268 | 10000 | 2013-12-12 00:00:00 | Q4 2013 | 49A53699682291323D04D66 | 305.54 | 611.08 | 505.58 | 105.50 | -16.77 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 9 | 0F043596202561788EA13D5 | 1023355 | 2013-12-02 10:43:39.117000000 | NaN | 36 | Current | NaN | 0.07620 | 0.0629 | 0.0529 | 0.05221 | 0.0099 | 0.04231 | 7.0 | AA | 11.0 | 7 | IL | Food Service | Employed | 269.0 | True | False | NaN | 2013-12-02 10:43:39 | 820.0 | 839.0 | 1993-04-01 00:00:00 | 17.0 | 16.0 | 32.0 | 12 | 219.0 | 1.0 | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 9906.0 | 0.11 | 77696.0 | 29.0 | 1.00 | 1.0 | 0.25 | $25,000-49,999 | True | 3333.333333 | 02163700809231365A56A1C | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 3 | 121268 | 10000 | 2013-12-12 00:00:00 | Q4 2013 | 49A53699682291323D04D66 | 305.54 | 611.08 | 505.58 | 105.50 | -16.77 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
# Convert listing category from numeric to categorical values
ListingCategory = {
0: 'Not Available', 1: 'Debt Consolidation', 2: 'Home Improvement', 3: 'Business', 4: 'Personal Loan', 5: 'Student Use',
6: 'Auto', 7: 'Other', 8: 'Baby&Adoption', 9: 'Boat', 10: 'Cosmetic Procedure',
11: 'Engagement Ring', 12: 'Green Loans', 13: 'Household Expenses', 14: 'Large Purchases', 15: 'Medical/Dental',
16: 'Motorcycle', 17: 'RV', 18: 'Taxes', 19: 'Vacation', 20: 'Wedding Loans'
}
loans['ListingCategory'] = loans['ListingCategory (numeric)'].apply(lambda x: ListingCategory[x])
loans = loans.drop(['ListingCategory (numeric)'], axis=1)
# Convert unordered variables to categorical columns
loans['EmploymentStatus'] = loans['EmploymentStatus'].astype('category')
loans['ListingCategory'] = loans['ListingCategory'].astype('category')
loans['BorrowerState'] = loans['BorrowerState'].astype('category')
loans['LoanStatus'] = loans['LoanStatus'].astype('category')
loans['Occupation'] = loans['Occupation'].astype('category')
loans['Term'] = loans['Term'].astype('category')
un_ordinal_var_dict = {
'EmploymentStatus': ['Self-employed', 'Employed', 'Full-time', 'Other', 'Not employed', 'Part-time', 'Retired'],
'ListingCategory': [ 'Home Improvement', 'Motorcycle', 'Debt Consolidation', 'Other', 'Household Expenses', 'Auto',
'Medical/Dental', 'Wedding Loans', 'Vacation', 'Business', 'Taxes', 'Baby&Adoption', 'Personal Loan', 'Engagement Ring',
'Large Purchases', 'Student Use', 'Boat', 'RV', 'Cosmetic Procedure', 'Green Loans'],
'BorrowerState': ['CO', 'GA', 'MN', 'NM', 'KS', 'CA', 'IL', 'MD', 'AL', 'AZ',
'VA', 'FL', 'PA', 'OR', 'MI', 'NY', 'LA', 'WI', 'OH', 'NC', 'WA',
'NV', 'NJ', 'TX', 'AR', 'SC', 'DE', 'MO', 'NE', 'UT', 'DC', 'MA',
'CT', 'IN', 'KY', 'OK', 'MS', 'WV', 'RI', 'TN', 'ID', 'MT', 'HI',
'NH', 'VT', 'WY', 'ME', 'AK', 'IA', 'SD', 'ND'],
'LoanStatus': ['Completed', 'Current', 'Past Due (1-15 days)', 'Defaulted',
'Chargedoff', 'Past Due (16-30 days)', 'Cancelled',
'Past Due (61-90 days)', 'Past Due (31-60 days)',
'Past Due (91-120 days)', 'FinalPaymentInProgress',
'Past Due (>120 days)'],
'Occupation': ['Other', 'Professional', 'Skilled Labor', 'Executive',
'Sales - Retail', 'Laborer', 'Food Service', 'Fireman',
'Waiter/Waitress', 'Construction', 'Computer Programmer',
'Sales - Commission', 'Retail Management', 'Engineer - Mechanical',
'Military Enlisted', 'Clerical', 'Teacher', 'Clergy',
'Accountant/CPA', 'Attorney', 'Nurse (RN)', 'Analyst',
"Nurse's Aide", 'Investor', 'Realtor', 'Flight Attendant',
'Nurse (LPN)', 'Military Officer', 'Food Service Management',
'Truck Driver', 'Administrative Assistant',
'Police Officer/Correction Officer', 'Social Worker',
'Tradesman - Mechanic', 'Medical Technician', 'Professor',
'Postal Service', 'Civil Service', 'Pharmacist',
'Tradesman - Electrician', 'Scientist', 'Dentist',
'Engineer - Electrical', 'Architect', 'Landscaping',
'Tradesman - Carpenter', 'Bus Driver', 'Tradesman - Plumber',
'Engineer - Chemical', 'Doctor', 'Chemist',
'Student - College Senior', 'Principal', "Teacher's Aide",
'Pilot - Private/Commercial', 'Religious', 'Homemaker',
'Student - College Graduate Student', 'Student - Technical School',
'Psychologist', 'Biologist', 'Student - College Sophomore',
'Judge', 'Student - College Junior', 'Car Dealer',
'Student - Community College', 'Student - College Freshman']
}
for var in un_ordinal_var_dict:
un_ordered_var = pd.api.types.CategoricalDtype(categories = un_ordinal_var_dict[var])
loans[var] = loans[var].astype(un_ordered_var)
# Convert date columns to type datetime
loans['ListingCreationDate'] = pd.to_datetime(loans['ListingCreationDate'].astype(str))
loans['ClosedDate'] = pd.to_datetime(loans['ClosedDate'].astype(str))
loans['DateCreditPulled'] = pd.to_datetime(loans['DateCreditPulled'].astype(str))
loans['LoanOriginationDate'] = pd.to_datetime(loans['LoanOriginationDate'].astype(str))
loans['FirstRecordedCreditLine'] = pd.to_datetime(loans['FirstRecordedCreditLine'].astype(str))
# convert cut, color, and clarity into ordered categorical types
ordinal_var_dict = {
'ProsperRating (Alpha)': ['HR','E','D','C','B','A','AA'],
'CreditGrade': ['HR','E','D','C','B','A','AA'],
'IncomeRange': ['Not employed', '$0', '$1-24,999', '$25,000-49,999', '$50,000-74,999', '$75,000-99,999', '$100,000+'],
'LoanOriginationQuarter': [
'Q4 2005',
'Q1 2006','Q2 2006','Q3 2006','Q4 2006',
'Q1 2007','Q2 2007','Q3 2007','Q4 2007',
'Q1 2008','Q2 2008','Q3 2008','Q4 2008',
'Q2 2009','Q3 2009','Q4 2009',
'Q1 2010','Q2 2010','Q3 2010','Q4 2010',
'Q1 2011','Q2 2011','Q3 2011','Q4 2011',
'Q1 2012','Q2 2012','Q3 2012','Q4 2012',
'Q1 2013','Q2 2013','Q3 2013','Q4 2013',
'Q1 2014'
]
}
for var in ordinal_var_dict:
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = ordinal_var_dict[var])
loans[var] = loans[var].astype(ordered_var)
loans.describe()
| ListingNumber | BorrowerAPR | BorrowerRate | LenderYield | EstimatedEffectiveYield | EstimatedLoss | EstimatedReturn | ProsperRating (numeric) | ProsperScore | EmploymentStatusDuration | CreditScoreRangeLower | CreditScoreRangeUpper | CurrentCreditLines | OpenCreditLines | TotalCreditLinespast7years | OpenRevolvingAccounts | OpenRevolvingMonthlyPayment | InquiriesLast6Months | TotalInquiries | CurrentDelinquencies | AmountDelinquent | DelinquenciesLast7Years | PublicRecordsLast10Years | PublicRecordsLast12Months | RevolvingCreditBalance | BankcardUtilization | AvailableBankcardCredit | TotalTrades | TradesNeverDelinquent (percentage) | TradesOpenedLast6Months | DebtToIncomeRatio | StatedMonthlyIncome | TotalProsperLoans | TotalProsperPaymentsBilled | OnTimeProsperPayments | ProsperPaymentsLessThanOneMonthLate | ProsperPaymentsOneMonthPlusLate | ProsperPrincipalBorrowed | ProsperPrincipalOutstanding | ScorexChangeAtTimeOfListing | LoanCurrentDaysDelinquent | LoanFirstDefaultedCycleNumber | LoanMonthsSinceOrigination | LoanNumber | LoanOriginalAmount | MonthlyLoanPayment | LP_CustomerPayments | LP_CustomerPrincipalPayments | LP_InterestandFees | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.139370e+05 | 113912.000000 | 113937.000000 | 113937.000000 | 84853.000000 | 84853.000000 | 84853.000000 | 84853.000000 | 84853.000000 | 106312.000000 | 113346.000000 | 113346.000000 | 106333.000000 | 106333.000000 | 113240.000000 | 113937.00000 | 113937.000000 | 113240.000000 | 112778.000000 | 113240.000000 | 106315.000000 | 112947.000000 | 113240.000000 | 106333.000000 | 1.063330e+05 | 106333.000000 | 106393.000000 | 106393.000000 | 106393.000000 | 106393.000000 | 105383.000000 | 1.139370e+05 | 22085.000000 | 22085.000000 | 22085.000000 | 22085.000000 | 22085.000000 | 22085.000000 | 22085.000000 | 18928.000000 | 113937.000000 | 16952.000000 | 113937.000000 | 113937.000000 | 113937.00000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 |
| mean | 6.278857e+05 | 0.218828 | 0.192764 | 0.182701 | 0.168661 | 0.080306 | 0.096068 | 4.072243 | 5.950067 | 96.071582 | 685.567731 | 704.567731 | 10.317192 | 9.260164 | 26.754539 | 6.96979 | 398.292161 | 1.435085 | 5.584405 | 0.592052 | 984.507059 | 4.154984 | 0.312646 | 0.015094 | 1.759871e+04 | 0.561309 | 11210.225447 | 23.230034 | 0.885897 | 0.802327 | 0.275947 | 5.608026e+03 | 1.421100 | 22.934345 | 22.271949 | 0.613629 | 0.048540 | 8472.311961 | 2930.313906 | -3.223214 | 152.816539 | 16.268464 | 31.896882 | 69444.474271 | 8337.01385 | 272.475783 | 4183.079489 | 3105.536588 | 1077.542901 | -54.725641 | -14.242698 | 700.446342 | 681.420499 | 25.142686 | 0.998584 | 0.048027 | 0.023460 | 16.550751 | 80.475228 |
| std | 3.280762e+05 | 0.080364 | 0.074818 | 0.074516 | 0.068467 | 0.046764 | 0.030403 | 1.673227 | 2.376501 | 94.480605 | 66.458275 | 66.458275 | 5.457866 | 5.022644 | 13.637871 | 4.63097 | 447.159711 | 2.437507 | 6.429946 | 1.978707 | 7158.270157 | 10.160216 | 0.727868 | 0.154092 | 3.293640e+04 | 0.317918 | 19818.361309 | 11.871311 | 0.148179 | 1.097637 | 0.551759 | 7.478497e+03 | 0.764042 | 19.249584 | 18.830425 | 2.446827 | 0.556285 | 7395.507650 | 3806.635075 | 50.063567 | 466.320254 | 9.005898 | 29.974184 | 38930.479610 | 6245.80058 | 192.697812 | 4790.907234 | 4069.527670 | 1183.414168 | 60.675425 | 109.232758 | 2388.513831 | 2357.167068 | 275.657937 | 0.017919 | 0.332353 | 0.232412 | 294.545422 | 103.239020 |
| min | 4.000000e+00 | 0.006530 | 0.000000 | -0.010000 | -0.182700 | 0.004900 | -0.182700 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 19.000000 | 0.000000 | 0.000000 | 2.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -209.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1000.00000 | 0.000000 | -2.349900 | 0.000000 | -2.349900 | -664.870000 | -9274.750000 | -94.200000 | -954.550000 | 0.000000 | 0.700000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 4.009190e+05 | 0.156290 | 0.134000 | 0.124200 | 0.115670 | 0.042400 | 0.074080 | 3.000000 | 4.000000 | 26.000000 | 660.000000 | 679.000000 | 7.000000 | 6.000000 | 17.000000 | 4.00000 | 114.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.121000e+03 | 0.310000 | 880.000000 | 15.000000 | 0.820000 | 0.000000 | 0.140000 | 3.200333e+03 | 1.000000 | 9.000000 | 9.000000 | 0.000000 | 0.000000 | 3500.000000 | 0.000000 | -35.000000 | 0.000000 | 9.000000 | 6.000000 | 37332.000000 | 4000.00000 | 131.620000 | 1005.760000 | 500.890000 | 274.870000 | -73.180000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 50% | 6.005540e+05 | 0.209760 | 0.184000 | 0.173000 | 0.161500 | 0.072400 | 0.091700 | 4.000000 | 6.000000 | 67.000000 | 680.000000 | 699.000000 | 10.000000 | 9.000000 | 25.000000 | 6.00000 | 271.000000 | 1.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 8.549000e+03 | 0.600000 | 4100.000000 | 22.000000 | 0.940000 | 0.000000 | 0.220000 | 4.666667e+03 | 1.000000 | 16.000000 | 15.000000 | 0.000000 | 0.000000 | 6000.000000 | 1626.550000 | -3.000000 | 0.000000 | 14.000000 | 21.000000 | 68599.000000 | 6500.00000 | 217.740000 | 2583.830000 | 1587.500000 | 700.840100 | -34.440000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 44.000000 |
| 75% | 8.926340e+05 | 0.283810 | 0.250000 | 0.240000 | 0.224300 | 0.112000 | 0.116600 | 5.000000 | 8.000000 | 137.000000 | 720.000000 | 739.000000 | 13.000000 | 12.000000 | 35.000000 | 9.00000 | 525.000000 | 2.000000 | 7.000000 | 0.000000 | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 1.952100e+04 | 0.840000 | 13180.000000 | 30.000000 | 1.000000 | 1.000000 | 0.320000 | 6.825000e+03 | 2.000000 | 33.000000 | 32.000000 | 0.000000 | 0.000000 | 11000.000000 | 4126.720000 | 25.000000 | 0.000000 | 22.000000 | 65.000000 | 101901.000000 | 12000.00000 | 371.580000 | 5548.400000 | 4000.000000 | 1458.540000 | -13.920000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 115.000000 |
| max | 1.255725e+06 | 0.512290 | 0.497500 | 0.492500 | 0.319900 | 0.366000 | 0.283700 | 7.000000 | 11.000000 | 755.000000 | 880.000000 | 899.000000 | 59.000000 | 54.000000 | 136.000000 | 51.00000 | 14985.000000 | 105.000000 | 379.000000 | 83.000000 | 463881.000000 | 99.000000 | 38.000000 | 20.000000 | 1.435667e+06 | 5.950000 | 646285.000000 | 126.000000 | 1.000000 | 20.000000 | 10.010000 | 1.750003e+06 | 8.000000 | 141.000000 | 141.000000 | 42.000000 | 21.000000 | 72499.000000 | 23450.950000 | 286.000000 | 2704.000000 | 44.000000 | 100.000000 | 136486.000000 | 35000.00000 | 2251.510000 | 40702.390000 | 35000.000000 | 15617.030000 | 32.060000 | 0.000000 | 25000.000000 | 25000.000000 | 21117.900000 | 1.012500 | 39.000000 | 33.000000 | 25000.000000 | 1189.000000 |
loans.describe(include=['category'])
| CreditGrade | Term | LoanStatus | ProsperRating (Alpha) | BorrowerState | Occupation | EmploymentStatus | IncomeRange | LoanOriginationQuarter | ListingCategory | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 28812 | 113937 | 113937 | 84853 | 108422 | 110349 | 106335 | 106196 | 113937 | 96972 |
| unique | 7 | 3 | 12 | 7 | 51 | 67 | 7 | 7 | 33 | 20 |
| top | C | 36 | Current | C | CA | Other | Employed | $25,000-49,999 | Q4 2013 | Debt Consolidation |
| freq | 5649 | 87778 | 56576 | 18345 | 14717 | 28617 | 67322 | 32192 | 14450 | 58308 |
loans.describe(include=['object'])
| ListingKey | GroupKey | LoanKey | MemberKey | |
|---|---|---|---|---|
| count | 113937 | 13341 | 113937 | 113937 |
| unique | 113066 | 706 | 113066 | 90831 |
| top | 17A93590655669644DB4C06 | 783C3371218786870A73D20 | CB1B37030986463208432A1 | 63CA34120866140639431C9 |
| freq | 6 | 1140 | 6 | 9 |
There are 113,937 loans in the dataset
The Proper Rating and Credit Grade follow thesame ordered grading system.
The Prosper Rating and Credit Grade both represent Loan Grades for loan pre and post 2009.
There are also other categorical values such as: LoanStatus, BorrowerState, Occupation, EmploymentStatus, IncomeRange, LoanOriginationQuarter, Term, ListingCategory
Most of the remaining variables are numeric.
My interest is focused on the loan status more specifically to analyze the features that affect loan default rates.
The dataset has a large number of features in the dataset but for this analysis which makes it tricky to perform a full blown analysis. I would focus only
I think the most interesting features for this investigation will be the CreditGrade, ProsperRating (Alpha), Occupation, EmploymentStatus, IncomeRange, ListingCategory which will have the most impact on the loans score. Other numeric variables too such as the Loan Amount will also have an impact on the loan scores.
First I'll look at the distribution of the Loan Status
plt.figure(figsize=[8, 5])
default_color = sb.color_palette()[0]
sb.countplot(data=loans, x='LoanStatus', color=default_color)
plt.yscale('log')
plt.xlabel('Loan Status')
plt.yticks([1e1, 1e2, 1e3, 1e4], [10, 100, '1k', '10k'])
plt.xticks(rotation=30, ha='right')
plt.show()
The Loan Status has a lot of categories which aren't very relevant to my subject of interest and can be simplified for better visibility in the observations.
For this, I will simplify the loan status into 4 categories: Completed, Current, Defaulted and Overdue.
m = {
'Completed': 'Completed', 'FinalPaymentInProgress': 'Completed', 'Current': 'Current',
'Defaulted': 'Defaulted', 'Chargedoff': 'Defaulted', 'Cancelled': 'Defaulted',
'Past Due (1-15 days)': 'Overdue', 'Past Due (16-30 days)': 'Overdue', 'Past Due (61-90 days)': 'Overdue',
'Past Due (31-60 days)': 'Overdue', 'Past Due (91-120 days)': 'Overdue', 'Past Due (>120 days)': 'Overdue'
}
loans['LoanStatus'] = loans['LoanStatus'].map(m).astype('category')
plt.figure(figsize=[8, 5])
default_color = sb.color_palette()[0]
sb.countplot(data=loans, x='LoanStatus', color=default_color)
plt.xlabel('Loan Status')
plt.show()
From the plot, we can see that the Current Loans take up most of the listings available and a very small proportion are the overdue loans. Current and Overdue loans as not useful for my analysis and to minimize clutter in my visualizations, I will drop them.
loans = loans[loans['LoanStatus'].isin(['Completed', 'Defaulted'])]
loans['LoanStatus'] = loans['LoanStatus'].astype('str').astype('category')
plt.figure(figsize=[5, 5])
default_color = sb.color_palette()[0]
sb.countplot(data=loans, x='LoanStatus', color=default_color)
plt.xlabel('Loan Status')
plt.show()
loan_status_count = loans['LoanStatus'].value_counts()
loan_status_percent = loans['LoanStatus'].value_counts(normalize=True) * 100
print(f"Loan Status Counts : \n{loan_status_count}")
print("\n")
print(f"Loan Status Percent : \n{loan_status_percent}")
Loan Status Counts : Completed 38279 Defaulted 17015 Name: LoanStatus, dtype: int64 Loan Status Percent : Completed 69.228126 Defaulted 30.771874 Name: LoanStatus, dtype: float64
Next, I will observe the categorical values to see how they are distributed
# Plotting categorical variables together to get an idea of each ordinal variable's distribution.
fig, ax = plt.subplots(nrows=3, ncols=2, figsize = [16,10])
default_color = sb.color_palette()[0]
sb.countplot(data = loans, x = 'ProsperRating (Alpha)', color = default_color, ax = ax[0,0])
sb.countplot(data = loans, x = 'CreditGrade', color = default_color, ax = ax[1,0])
sb.countplot(data = loans, x = 'IncomeRange', color = default_color, ax = ax[2,0])
sb.countplot(data = loans, x = 'Term', color = default_color, ax = ax[0,1])
sb.countplot(data = loans, x = 'EmploymentStatus', color = default_color, ax = ax[1,1])
sb.countplot(data = loans, x = 'ListingCategory', color = default_color, ax = ax[2,1])
plt.setp(ax[2,0].get_xticklabels(), rotation=30, ha="right", rotation_mode="anchor")
plt.setp(ax[2,1].get_xticklabels(), rotation=30, ha="right", rotation_mode="anchor")
plt.tight_layout()
plt.show()
From the datad dictionary, the Credit Grade and Prosper Rating both represent the loan grade before and after 2009. These variable can be merged as one to better understand the visualisations and analysis
# First combine the CreditGrade and ProsperRating into 1 variable, LoanGrade.
from pandas import CategoricalDtype
loans['LoanGrade'] = loans['CreditGrade'].combine_first(loans['ProsperRating (Alpha)'])
loans['LoanGrade'] = loans['LoanGrade'].astype(CategoricalDtype(ordered=True, categories=['HR','E','D','C','B','A','AA']))
The employment status too is quite interesting but has too many categories that potentially mean thesame in the case of our analysis and would just clutter my observations. The best thinwould be to simplify these categories into Employed, Unemployed, Retired and Other
# Simplify the employment status into `Employed`, `Unemployed`, `Retired` and `Other`
m = {
'Self-employed': 'Employed', 'Full-time': 'Employed', 'Part-time': 'Employed', 'Employed': 'Employed',
'Not employed': 'Unemployed', 'Retired': 'Retired', 'Other': 'Other'
}
loans['EmploymentStatus'] = loans['EmploymentStatus'].map(m).astype('category')
From the plot, we can also see that most listings were in the income range of 25000 to 75000. There are also big income earners within the income range greater than 75000. A smaller part of the distribution have incomes less than 25000. The Not Employed range and $0 earners can be considered thesame.
# Joins `Not Employed` and `$0` income range.
m = {
'Not employed': '$0',
'$0': '$0',
'$1-24,999': '$1-24,999',
'$25,000-49,999': '$25,000-49,999',
'$50,000-74,999': '$50,000-74,999',
'$75,000-99,999': '$75,000-99,999',
'$100,000+': '$100,000+'
}
loans['IncomeRange'] = loans['IncomeRange'].map(m).astype(CategoricalDtype(ordered=True,
categories=['$0', '$1-24,999', '$25,000-49,999',
'$50,000-74,999', '$75,000-99,999', '$100,000+']))
# Replotting the simplified categories
fig, ax = plt.subplots(ncols=3, figsize = [20,5])
default_color = sb.color_palette()[0]
sb.countplot(data = loans, x = 'LoanGrade', color = default_color, ax = ax[0])
sb.countplot(data = loans, x = 'EmploymentStatus', color = default_color, ax = ax[1])
sb.countplot(data = loans, x = 'IncomeRange', color = default_color, ax = ax[2])
ax[0].set_xlabel("Loan Grade")
ax[1].set_xlabel("Employment Status")
ax[2].set_xlabel("Income Range")
plt.setp(ax[2].get_xticklabels(), rotation=30, ha="right", rotation_mode="anchor")
plt.show()
The loan grade shows an interesting distribution. It can be observed that loans in the grade D and C are very popular. The employment status on the other hand mostly has employed values and close to no values in the other categories. The income range too is shows that most listings were in the income range of 25000 to 75000 with significantly little 0 income earners.
# Bar plot of occupation
plt.figure(figsize=[20, 5])
default_color = sb.color_palette()[0]
sb.countplot(data=loans, x='Occupation', color=default_color)
plt.xticks(rotation=45, ha='right', rotation_mode='anchor')
plt.xlabel('Occupation')
plt.show()
The occupation distribution is a little interesting. Though most of the occupations are registered as Other and Professional which is quite vague and doesn't convey any valuable insight for my analysis.
Regardless of that, we can observe that most borrowers were either Accountant, Administrative Assistant, Analyst, Clerical, Computer Programmer, Construction, Executive, Retail Management, Sales - Commission, Sales - Retail, Skilled Labor or Teacher.
I will also look at the loan origination quarter to see the periods with the most listings
# Countplot of Loan Origination Quarter
plt.figure(figsize=[20, 5])
default_color = sb.color_palette()[0]
sb.countplot(data=loans, x='LoanOriginationQuarter', color=default_color)
plt.xticks(rotation=45, ha='right')
plt.xlabel('Loan Origination Quarter')
plt.show()
From the plot we can observe a wave of loans with a spike between Q2 of 2006 to Q4 of 2008 and another slight increase between Q3 2009 to Q1 2014 though the spikle is greater in the former.
I also have in the records information on if the borrower is a home owner, is the income veriable or belongs to a group
# Plotting Boolean Variables together to see the distribution
fig, ax = plt.subplots(ncols=3, figsize = [20,5])
bool_vars = ['IsBorrowerHomeowner', 'CurrentlyInGroup', 'IncomeVerifiable']
for i, var in enumerate(bool_vars):
count = loans[var].value_counts()
ax[i].pie(count, labels=count.index, autopct='%1.1f%%')
ax[0].set_xlabel("Is Borrower Homeower")
ax[1].set_xlabel("Is Borrower In Group")
ax[2].set_xlabel("Is Income Verifiable")
plt.show()
The distribution shows more interesting insights with the home ownership status where there are almost equally distributed. The Group membership status and income verifiable are mostly one sided and would reveal little insight.
Next, I would look at numeric values.
# Plot a standard scaled plot of Loan Amounts
log_binsize = 0.05
bins = 10 ** np.arange(3, np.log10(loans['LoanOriginalAmount'].max())+log_binsize, log_binsize)
plt.figure(figsize=[8, 5])
plt.hist(data=loans, x='LoanOriginalAmount')
plt.show()
# Put the plot on a log scale to better visualize the values on the right
log_binsize = 0.05
bins = 10 ** np.arange(3, np.log10(loans['LoanOriginalAmount'].max())+log_binsize, log_binsize)
plt.figure(figsize=[8, 5])
plt.hist(data=loans, x='LoanOriginalAmount', bins=bins)
plt.xscale('log')
plt.xlim([9e2, 3e4])
plt.xticks([1e3, 2e3, 5e3, 1e4, 2e4], ['1k', '2k', '5k', '10k', '20k'])
plt.yticks([1e3, 2e3, 3e3, 4e3, 5e3, 6e3], ['1k', '2k', '3k', '4k', '5k', '6k'])
plt.show()
The loan amount has an interesting distribution. There are sporadic high spikes showing that some loan amounts were very popular among borrowers with the highest peak in amounts between 3000 and 5000.
# Plotting the borrower rate on a logit scale
binsize = 0.01
bins = np.arange(0.04, 0.4, binsize)
labels = [0.1, 0.2, 0.3, 0.4]
plt.figure(figsize=[8, 5])
plt.hist(data = loans, x = 'BorrowerRate', bins=bins)
plt.xscale('logit')
plt.xticks(labels, labels)
plt.xlabel('Borrower Rate')
plt.show()
# Plotting the borrower APR on a logit scale
binsize = 0.01
bins = np.arange(0.05, 0.5, binsize)
labels = [0.1, 0.2, 0.3, 0.4]
plt.figure(figsize=[8, 5])
plt.hist(data = loans, x = 'BorrowerAPR', bins=bins)
plt.xscale('logit')
plt.xticks(labels, labels)
plt.xlabel('Borrower APR')
plt.show()
Plotting the borrower Rate and Borrower APR, I observe that they have a verys similar dsitribution which makes sense as they basically represent thesame value except for slight differences in the Borrower's APR as it not only represents the interest rate but also additional fees and charges included in the rate which makes the APR a better category for observation.
# Plotting the Stated Monthly Income on a log scale
log_binsize = 0.05
bins = 10 ** np.arange(0, np.log10(loans['StatedMonthlyIncome'].max())+log_binsize, log_binsize)
plt.figure(figsize=[8, 5])
plt.hist(data=loans, x='StatedMonthlyIncome', bins=bins)
plt.xscale('log')
plt.xlim([100, 1e5])
plt.xticks([1e2, 1e3, 1e4, 1e5], [100, '1k', '10k', '100k'])
plt.xlabel("Stated Monthly Income")
plt.show()
The plot shows that the monthly income is typically in the range between 2000 and 12000 per month. I would be more interesting to observe the income per annum as this will provide easier to process insights in the data that can easily be correlated to the Income Range categorical variable.
# Create annual income column
loans['AnnualIncome'] = loans['StatedMonthlyIncome'] * 12
# Plotting the annual income on a log scale
log_binsize = 0.05
bins = 10 ** np.arange(0, np.log10(loans['AnnualIncome'].max())+log_binsize, log_binsize)
plt.figure(figsize=[8, 5])
plt.hist(data=loans, x='AnnualIncome', bins=bins)
plt.xscale('log')
plt.xlim([100, 1e7])
plt.xticks([1e2, 1e3, 1e4, 1e5, 1e6], [100, '1k', '10k', '100k', '1M'])
plt.xlabel("Annual Income")
plt.show()
We can see that the most income earners are in the range of 30000 and 80000 per annum which correlates with the income range observed above.
The numeric variables took long range of values and needed scaling to observe their distribtution better. The Annual Income and Loan Amount had to be scaled on a log scale while the Borrower's Interest Rate and APR had to be scaled on a logit scale since the values where distributed between 0 and 1.
The categorical variables had too many details that were hard to observe in a plot and required to be simplified. I did some summarization in the loan status, employment status and income range values to adjust them into smaller and more meaningful categories.
While investigating the Occupation, most values where recorded either as Other or Professional which represents very vague information.
numeric_vars = ['LoanOriginalAmount', 'BorrowerAPR', 'AnnualIncome', 'AmountDelinquent', 'RevolvingCreditBalance']
categoric_vars = ['Term', 'IncomeRange', 'EmploymentStatus', 'LoanGrade', 'LoanStatus']
First I want to investigate any possible correlations between the numeric variables
# correlation plot
plt.figure(figsize = [8, 5])
sb.heatmap(loans[numeric_vars].corr(), annot = True, fmt = '.3f', cmap = 'vlag_r', center = 0)
plt.show()
# plot matrix with a 500 sample for faster and clearer rendering
loans.reset_index(inplace=True, drop=True)
samples = np.random.choice(loans.shape[0], 500, replace = False)
loan_samples = loans.loc[samples,:]
g = sb.PairGrid(data=loan_samples, vars = numeric_vars)
g = g.map_diag(plt.hist, bins = 20);
g.map_offdiag(plt.scatter)
plt.show()
From the heatmap and pairgrid, there are no strong correlations between the numeric variables. Their associations are very random.
fig, ax = plt.subplots(ncols=3, figsize = [20, 5])
# subplot 1: IncomeRange vs LoanStatus
sb.countplot(data = loans, x = 'IncomeRange', hue = 'LoanStatus', palette = 'Blues', ax=ax[0])
# subplot 2: LoanGrade vs. LoanStatus
sb.countplot(data = loans, x = 'LoanGrade', hue = 'LoanStatus', palette = 'Blues', ax=ax[1])
# subplot 2: IncomeRange vs. LoanGrade
sb.countplot(data = loans, x = 'IncomeRange', hue = 'LoanGrade', palette = 'Reds', ax=ax[2])
plt.setp(ax[0].get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
plt.setp(ax[2].get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
plt.show()
The loan status distribution with the income range seems balanced with relatively low default rates through out the distribution except for the 0 income earners who have almost equal defaulters and completed loans.
The loan grade vs default rate shows a steady drop in defaults in increase in loan grade. Loans with HR grade were just as likely to default while those with AA where less likely to default.
The plot of the income range with the loan grade is more insightful. We can observe people with a higher income range tend to take loans with with a higher grade. Loans of Grade D is very popular wit people in the 25000 to 50000 income range.
# Observing Occupation against Loan Status
plt.figure(figsize=[20, 5])
loans_occupations = loans[loans['Occupation'] != 'Other']
loans_occupations = loans_occupations[loans_occupations['Occupation'] != 'Professional']
sb.countplot(data=loans_occupations, x='Occupation', hue='LoanStatus', palette='Blues')
plt.xticks(rotation=45, ha="right", rotation_mode="anchor")
plt.show()
The popular occupations such as accountant have low default rates while less popular tend to have a higher default rate.
# Observing Loan Origination Quarter against Loan Status
plt.figure(figsize=[20, 5])
sb.countplot(data=loans, x='LoanOriginationQuarter', hue='LoanStatus', palette='Blues')
plt.xticks(rotation=45, ha="right", rotation_mode="anchor")
plt.show()
The loan origination quarter by loan status plot shows an increase in completed loans from Q3 2009 with no defaulters in Q4 of 2013 and Q1 2014.
# Observing Income Range by Loan Amount
income_range_loans = loans.groupby('IncomeRange').mean()
plt.figure(figsize=[10, 5])
plt.bar(x=income_range_loans.index, height=income_range_loans['LoanOriginalAmount'])
plt.show()
/var/folders/fy/8jb2hm6j5vb8xcb_0pgh11x80000gn/T/ipykernel_6823/2821077832.py:3: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
income_range_loans = loans.groupby('IncomeRange').mean()
Investigating the Loan Amount per Income Range, I could see that people with 0 income tned to have more loans and the need for loans drops with 1-25000 income range and gradually increases with income range.
# Use a Pair Grid of Box Plots to observe the relationship between numeric and categorical variables
loan_samples = loans.sample(n=2000, replace = False)
default_color = sb.color_palette()[0]
def boxgrid(x, y, **kwargs):
sb.boxplot(x=x, y=y, color=default_color)
plt.figure(figsize = [15, 15])
g = sb.PairGrid(data = loan_samples, y_vars = numeric_vars, x_vars = categoric_vars,
height = 3, aspect = 1.5)
g.map(boxgrid)
plt.tight_layout()
plt.show()
<Figure size 1500x1500 with 0 Axes>
# Investigate the numerical and categorical variables relations with violin plots
loan_samples = loans.sample(n=2000, replace = False)
fig, ax = plt.subplots(ncols=len(numeric_vars), nrows=len(categoric_vars) , figsize=[35,25])
for i, category in enumerate(categoric_vars):
for j, n_var in enumerate(numeric_vars):
sb.violinplot(data=loan_samples, x=category, y=n_var, ax=ax[i,j], color=default_color)
plt.tight_layout()
plt.show()
Most of the relationships are fixed with few alterations across categories. The annual income by income range show an interesting distribution with people with a higher income in the higher income range which makes sense. There is an interesting observation between the borrower APR and the loan grade. There is a steady drop in borrower APR with incread in loan grade.
grid = sb.FacetGrid(loans, col='ListingCategory', col_wrap=5)
grid.map(sb.histplot, 'LoanStatus')
plt.show()
Investigating the listing category frequency with the loan status does not prove very insightful especially given the skewd nature of our dataset which contains significantly more completed loans than defaulted loans.
There was an interesting observation with between the loan status and the borrowers APR in the voilin plot. The mean showed that the higher the APR, the higher the chance of default. I also observed loan with larger amounts were taken for longer terms too.
Per yearly quarter, more loans were taken before 2009 and the loan default rate was also significantly higher in that period. After 2009, fewer loans were taken and the loans also had a lower default rate.
There was an expected correlation between the annual income and income range. Most of the other relation ships were mostly stacked on one value across categories or value levels.
I also observed that most Unemployed listings and took more of the smaller loans.
First I will investigate how numerica values interact with each other with respect to the loan status.
# Use PairPlot to Observe correlation between numeracal values of interest and loan status
sb.pairplot(data=loans, vars=['LoanOriginalAmount', 'BorrowerAPR', 'AnnualIncome', 'DebtToIncomeRatio'],
hue='LoanStatus')
plt.show()
The pairplot of numeric variables with the loan status as hue has very interesting insights, though the numeric variables do not have any strong correlations, I could observe that at certain values in the number of defaulters increased. For example, we can observe that at higher borrower APR, there were more defaulters and this observation is also true for variables like the annual income and the debt to income ratio, the
sb.barplot(x='LoanGrade', y='LoanOriginalAmount', data=loans, hue='LoanStatus')
plt.ylabel("Loan Amount")
plt.xlabel("Loan Grade")
plt.show()
A plot of the loan amount with the loan grade shows that high risk loans which is directly proportional to loan amount have high default rates. You can notice this by oberserving the in increase the difference between completed and defaulted loans.
# Use FacetGrid of histPlot to investigate the relationship between ListingCategory, LoanStatus and Loan Amount
grid = sb.FacetGrid(loans, col='ListingCategory', hue='LoanStatus', col_wrap=4)
grid.map(sb.histplot, 'LoanOriginalAmount')
plt.show()
Investigating the loan amount for each loan purpose and observing their default rate shows that loans taken for business had higher default rates with increasing loan amount. This is also true for Home Improvement loans where the default rate increases with increase in loan amount. Loans taken for debt consolidation had the relatively lowest default rates across all amounts.
# Investigate with a violin plot to better observe the rates with the smaller loan amounts and frequency.
plt.figure(figsize = [20, 5])
# Plot violin plot of loan purpose and loan amount by default status
sb.violinplot(x="ListingCategory", y="LoanOriginalAmount", hue="LoanStatus", data=loans, split=True)
plt.xticks(rotation=45, ha="right", rotation_mode="anchor")
plt.show()
The violin plots makes the observations even clearer. There are more defaulter for boat loans with high loan amounts. We also see big shifts to right, that is more higher default rates in household expense, wedding, auto, medical, vacation loans. This could be due to the fact that these are purposes which depreciate in value over time.
# Group data by loan purpose, loan grade, loan term, and default status
grouped = loans.groupby(["LoanGrade", "Term", "LoanStatus"]).size().unstack(fill_value=0)
# Plot clustered bar chart of loan grade, loan term, and default status by loan purpose
grouped.plot(kind="bar", stacked=True)
plt.show()
The investigation of loan status by term and grade shows an increase in default rate with increase in loan term with much higher default rates in low grade loans such the HR, E and D grades.
# Group data by home ownership status, loan purpose, loan grade, and default status
grouped = loans.groupby(["IsBorrowerHomeowner", "LoanGrade", "LoanStatus"]).size().unstack(fill_value=0)
# Plot clustered bar chart of loan purpose, loan grade, and default status by home ownership status
grouped.plot(kind="bar", stacked=True)
plt.show()
Grade C loans are more popular among non-homeowners. However, non-homeowners show a much higher default rate than home owners. Homeowners seem to take higher grade loans and are also very likely to repay the loans.
Observing more than 2 variables at a time definitely showed some more interesting insights. Observing the loan amount with the listing category with loan status revealed that that listing categories with depreciating returns had more defaults compared to loans taken for business or other types of investments.
The multivariate analysis really showed how combining features had effect on loan status. Compared to bivariate analysis which mostly showed little correlations.
In the end, it can be observed that single variables have little impact on loan default rate as observed in the univariate and bivariate analzsis. Coming down to multivariate analysis, it was clearer how combining multiple variables showed clearer patterns in loan default rates.